An artificial neural network approach for sentence boundary disambiguation in urdu language text

نویسندگان

  • Shazia Raj
  • Zobia Rehman
  • Sonia Rauf
  • Rehana Siddique
  • Waqas Anwar
چکیده

Sentence boundary identification is an important step for text processing tasks, e.g., machine translation, POS tagging, text summarization etc., in this paper, we present an approach comprising of Feed Forward Neural Network (FFNN) along with part of speech information of the words in a corpus. Proposed adaptive system has been tested after training it with varying sizes of data and threshold values. The best results, our system produced are 93.05% precision, 99.53% recall and

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Challenges in Urdu Text Tokenization and Sentence Boundary Disambiguation

Urdu is morphologically rich language with different nature of its characters. Urdu text tokenization and sentence boundary disambiguation is difficult as compared to the language like English. Major hurdle for tokenization is improper use of space between words, where as absence of case discrimination makes the sentence boundary detection a difficult task. In this paper some issues regarding b...

متن کامل

A hybrid approach for urdu sentence boundary disambiguation

Sentence boundary identification is a preliminary step for preparing a text document for Natural Language Processing tasks, e.g., machine translation, POS tagging, text summarization and etc. We present a hybrid approach for Urdu sentence boundary disambiguation comprising of unigram statistical model and rule based algorithm. After implementing this approach, we obtained 99.48% precision, 86.3...

متن کامل

Adaptive Sentence Boundary Disambiguation

Labeling of sentence boundaries is a necessary prerequisite for many natural language processing tasks, including part-of-speech tagging and sentence alignment. Endof-sentence punctuation marks are ambiguous; to disambiguate them most systems use brittle, special-purpose regular expression grammars and exception rules. As an alternative, we have developed an e cient, trainable algorithm that us...

متن کامل

Quantum Neural Network based Parts of Speech Tagger for Hindi

The parts of speech disambiguation in corpora is most challenging area in Natural Language Processing. However, someworkshave been done in the past to overcome the problem of bilingual corpora disambiguation forHindi using Hidden Markov Model and Neural Network. In this paper,Quantum Neural Network (QNN) forHindi parts of speech tagger has been used.To analyze the effectiveness of the proposed ...

متن کامل

Compound Sentence Segmentation and Sentence Boundary Detection in Urdu

The raw Urdu corpus comprises of irregular and large sentences which need to be properly segmented in order to make them useful in Natural Language Engineering (NLE). This makes the Compound Sentences Segmentation (CSS) timely and vital research topic. The existing online text processing tools are developed mostly for computationally developed languages such as English, Japanese and Spanish etc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Int. Arab J. Inf. Technol.

دوره 12  شماره 

صفحات  -

تاریخ انتشار 2015